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Abstract 

Populations can be genetically isolated by geographic distance and by 
differences in their ecology or environment that decrease the rate of suc- 
cessful migration. Empirical studies often seek to investigate the relation- 
ship between genetic differentiation and some ecological variable(s) while 
accounting for geographic distance, but common approaches to this prob- 
lem (e.g. the partial Mantel test) have a number of drawbacks. In this 
article, we present a Bayesian method that enables users to quantify the rel- 
ative contributions of geographic distance and ecological distance to genetic 
differentiation between sampled populations or individuals. We model the 
allele frequencies in populations at a set of unlinked loci as spatial Gaussian 
processes, and model the covariance structure of pairs of populations as a 
decreasing function of both geographic and ecological distance between that 
pair. Parameters of the model are estimated using a Markov chain Monte 
Carlo algorithm. We have implemented this method, Bayesian Estimation 
of Differentiation in Alleles by Spatial Structure and Local Ecology {BE- 
DASSLE), in a user-friendly format in the statistical platform R, and we 
demonstrate its utility with a simulation study and empirical applications 
to human and teosinte datasets. 
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Introduction 



The level of genetic differentiation between populations is determined by the ho- 
mogenizing action of migration balanced against differentiating processes such as 
local adaptation, different adaptive responses to shared environments, and ran- 
dom genetic drift. Geography often limits dispersal, so that the rate of migration 
is higher between close populations and lower between more distant populations. 
The combination of local genetic drift and distance-limited migration results in 
a difference between allele frequencies that increases with geographic distance, a 



pattern of isolation by distance (Wright, 1943). Extensive theoretical work has 
described expected patterns of isolation by distance under a variety of models of 



genetic drift and migration ( Charlesworth et al. , 2003), in both equilibrium pop- 
ulations in which migration and drift reach a balance, and under non-equilibrium 
demographic models, such as population expansion or various scenarios of col- 



onization (Slatkin, 1993). A range of theoretical approaches have been applied. 



with authors variously modeling probabilities of identity of gene lineages (Malecot 



1975 Rousset, 1997), correlations in allele frequencies (Slatkin and Maruyama 



1975 Weir and Cockerham, 1984), or the structured coalescent (Hey, 1991; Nord- 



borg and Krone 2002). Although the details of these approaches are superficially 



different, they all operate on the premise that allele frequencies will be more similar 
between nearby populations than between distant ones. 

In addition to geographic distance, populations can also be isolated by dif- 
ferences in their ecology and environment that decrease the rate of successful 



migration through processes such as dispersal limitations (Wright, 1943), biased 



dispersal (Edelaar and Bolnick, 2012), or selection against migrants due to local 
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adaptation (Wright, 1943 Hendry, 2004). Thus, in an environmentally heteroge- 



neous landscape, genome-wide differentiation may increase between populations 
with both geographic distance and ecological distance. The relevant ecological 
distance may be distance along a single environmental axis, such as difference in 
average annual rainfall, or distance along a discrete axis describing some landscape 
or ecological feature not captured by pairwise geographic distance, such as being 
on serpentine versus non- serpentine soil, or being on different host plants. Isola- 



tion by distance has been observed in many species (Vekemans and Hardy, 2004 



Meirmans, 2012), with a large literature focusing on identifying other ecological 



and environmental correlates of genomic differentiation. 

The goals of these empirical studies are generally 1) to determine whether an 
ecological factor is playing a role in generating the observed pattern of genetic 
differentiation between populations and, 2) if it is, to determine the strength of 
that factor relative to that of geographic distance. The vast majority of this work 
makes use of the partial Mantel test to assses the association between pairwise 
genetic distance and ecological distance while accounting for geographic distance 



(Smouse et al. , 1986). 



A number of valid objections have been raised to the reliability and intepretabil- 



ity of the partial Mantel (Legendre and Fortin 2010 Guillot and Rousset, 2012). 



First, because the test statistic of the Mantel test is a matrix correlation, it as- 
sumes a linear dependence between the distance variables, and will therefore be- 



have poorly if there is a nonlinear relationship (Legendre and Fortin, 2010). Sec- 



ond, the Mantel and partial Mantel tests can exhibit high false positive rates when 
the variables measured are spatially autocorrelated, structure that is not accomo- 



dated by the permutation procedure to assess significance (Guillot and Rousset 



2012). Finally, in our view the greatest limitation of the partial Mantel test in its 



application to landscape genetics may be that it is only able to answer the first 
question posed above — whether an ecological factor is playing a role in gener- 
ating a pattern of genetic differentiation between populations — rather than the 
first and the second — the strength of that factor relative to that of geographic 
distance. By attempting to control for the effect' of geographic distance with 
matrix regressions, the partial Mantel test makes it hard to simultaneously infer 
the effect sizes of geography and ecology on genetic differentiation, and because 
the correlation coefficients are inferred for the matrices of post-regression residu- 
als, the inferred effects of both variables are not comparable — they are not in 
a common currency. We perceive this to be a crucial lacuna in the populations 
genetics methods toolbox, as studies quantifying the effects of local adaptation 
(e.g. (Rosenblum and Harmon, 2011[ )), host-associated differentiation (e.g. (Dres 



and Mallet, 2002 Gomez-Diaz et al. , 2010)), or isolation over ecological distance 



(e.g. (Andrew et al. , 2012 Mosca et al. , 2012)) all require rigorous comparisons 
to the effect of isolation by geographic distance. 

In this article, we present a method that enables users to quantify the relative 
contributions of geographic distance and ecological distance to genetic differenti- 
ation between sampled populations or individuals. To do this, we borrow tools 



from geostatistics (Diggle et al. , 1998) and model the allele frequencies at a set of 



unlinked loci as spatial Gaussian processes. We use statistical machinery similar 
to that employed by the Smooth and Continuous AssignmenTs (SCAT) program 



designed by (Wasser et al. , 2004) and the BayEnv and BayEnv2 programs de- 



signed by (Coop et al. , 2010) and (Giinther and Coop, 2013). Under this model 



the allele frequency of a local population deviates away from a global mean allele 



frequency specific to that locus, and populations covary, to varying extent, in the 
deviation they take from this global mean. We model the covariance structure 
of pairs of populations as a decreasing function of both geographic and ecological 
distance between them, so that populations that are closer in space or more similar 
in ecology may covary more in allele frequencies. 

The parameters of this model are estimated in a Bayesian framework using a 



Markov chain Monte Carlo algorithm (Metropolis et al. 1953 Hastings, 1970). 
We demonstrate the utility of this method with two previously published datasets. 
The first is a dataset from several subspecies of Zea mays, known collectively as 



teosinte (Fang et al. , 2012), in which we examine the contribution of difference in 
elevation to genetic differentiation between populations. The second is a subset 
of the Human Genome Diversity Panel (HGDP, ( [Conrad et al. 2006; Li et al 



2008)), for which we quantify the effect size of the Himalaya mountain range on 



genetic differentiation between human populations. We have coded this method 
— Bayesian Estimation of Differentiation in Alleles by Spatial Structure and Local 
Ecology [BEDASSLE) — in a user-friendly format in the statistical platform R 



(R Development Core Team, 2007), and the scripts are available for download at 
genes cape. org. 



Methods 
Data 

Our data consist of L unlinked biallelic single nucleotide polymorphisms (SNPs) 
in K populations; a matrix of pairwise geographic distance between the sampled 
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populations (D); and one or more environmental distance matrices (E). The ele- 
ments of our ecological distance matrix may be binary (e.g. same or opposite side 
of a hypothesized barrier to gene flow) or continuous (e.g. difference in elevation 
or average annual rainfall between two sampled populations). The matrices D 
and E can be arbitrary, subject to the constraint that D and E are nonnegative 
deflnite, which is satisfled if they are each matrices of distances with respect to 
some metric. We summarize our genetic data as a set of allele counts (C) and 
sample sizes (S), denoting by C^^k the number of observations of allele 1 at locus i 
in population k out of a total sample size of S^^k alleles. The designation of allele 
1 is arbitrary but consistent among populations at the same locus. 

Likelihood Function 

We model the data as follows. The Ce^k observed '1' alleles in population k result 
from a draw of size Si^k alleles from an underlying population in which allele 1 
at locus i is at frequency fi^k- The population frequencies /^^^ are themselves 
random variables, independent between loci, with covariances that are a function 
of pairwise geographic and ecological distance. A flexible way to model these 
covariances is to assume that the allele frequencies fi^k are multivariate normal 
random variables, inverse logit-transformed to lie between and 1. Specifically, 
we assume that ^ is a function of a transformed global mean allele frequency /i^ 
for locus i, and the deviation 9i^k from this mean in population k: 




1 



(1) 
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We can then write the binomial probabihty of the count of allele 1 at locus ^ in 
population k as 



In doing so, we are assuming that the individuals are out bred, so that our S(,^k 
alleles represent independent draws from this population frequency. We will return 
to relax this assumption later. 

To model the covariance of the allele frequencies across populations, we assume 
that O^^k are multivariate normally distributed, with mean zero and a covariance 
matrix VL that is a parametric function of the pairwise geographic and ecological 
distances between the sampled populations. We model the covariance between 
populations i and j as 



where Di j and E'jj are the pairwise geographic and ecological distances between 
populations i and j, respectively, and and ue are the effect sizes of geographic 
distance and ecological distance, respectively. The parameter is the variance of 
population specific deviate 9 (i.e. at Djj + E'jj = 0), and q;2 controls the shape of 
the decay of the covariance with distance. As alluded to above, as many separate 
ecological distance variables may be included as desired, each with its own 
effect size parameter, but here we restrict discussion to a model with one. 

With this model, writing a = {ao, q;d, a^, 0:2), the likelihood of the SNP counts 




(2) 




(3) 
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observed at locus £ in all sampled populations can now be expressed as 

K 

P{Ce,ee\Se,i^e,a) = P{ee\n{a))l[P{Ce,k\Si,k, /{Oe, i^e)) (4) 

fe=i 

where we drop subscripts to indicate a vector (e.g. = (Ca, . . . , Qx)), and 
P{9e\^) is the multivariate normal density with mean zero and covariance ma- 
trix Q. 

The joint likelihood of the SNP counts C and the transformed population allele 
frequencies 9 across all L unlinked loci in the sampled populations is given by 

L K 

P{C,e\S, fi, a) = l[P{eema)) JJ P{Q,k\Se,k, fiOt, i^e)) (5) 
e=i k=i 

Posterior Probability 

We take a Bayesian approach to inference on this problem, and specify priors 
on each of our parameters. We denote a gamma distribution with given shape 
and rate parameters as F (shape, rate), a normal distribution with given mean and 
variance parameters as A'"(mean, variance), an exponential distribution with given 
rate parameter Exp (rate), and a uniform distribution between given upper and 
lower boundaries as t/(lower, upper). The priors specified on the parameters of 
this model are: ao ~ r(0.001, 0.001); ao ~ Exp(l); as ~ Exp(l); ^2 ~ C/(0.1,2); 
and He ~ iV(0, 1//9), with a hyper-prior /3 ~ r(l, 1). 
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The full expression for the joint posterior density is therefore given by 



P{9, fj,, ao, an, (Xe, oli-, ^\C, S) oc \e=i 




(6) 



xP{P)P{ao)P{aD)P{aE)P{a2) 



where the various P denote the appropriate marginal densities, and the proportion- 
ality is up to the normalization constant given by the right-hand side integrated 
over all parameters. 

Markov chain Monte Carlo 

We wish to estimate the posterior distribution of our parameters, particularly an 
and aE- As the integral of the posterior density given above cannot be solved 
analytically, we use Markov chain Monte Carlo (MCMC) to sample from the dis- 
tribution. Our MCMC scheme proceeds as follows. The chain is initiated at 
maximum likelihood estimates (MLEs) for 6 and n, and, for aQ,aij,aE, and 0:25 
at values drawn randomly from their priors. The empirical standard deviation of 
the MLEs of is used as the initial value of /3. 

In each generation one of {/i, (3,9,ao,aD,C(E,C(2} is selected at random to be 
updated. 

The priors on (3 and ccq are conjugate to their marginal posteriors, and each is 
updated via a Gibbs sampling step. The updated value of (3 given the current 11 
is drawn from 
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and the updated value of ao conditional on the current set of 6 is drawn from 

«o I ^1, ■ ■ ■ , ~ r ||l + ^, 1 + ^ ^ 9,,^X'X^ , (8) 

where k is the number of populations sampled, L is the number of loci sequenced, 
and X = Q^o^ = exp (— (a^-Dij + The proposed updates to 9 do not 

affect each other, and so are accepted or rejected independently. Following Wasser 



et al. (2004) (derived from (Christensen and Waagepetersen , 2002 M0ller et al. 



1998)), the proposal is chosen as 9'^ = Oi + RiZ, where i? is a vector of normally 
distributed random variables with mean zero and small variance (controlled by 
the scale of the tuning parameter on 9) and Z is the Cholesky decomposition of 
Q (so that ZZ^ = Q). Under this proposal mechanism, proposed updates to 9e 
tend to stay within the region of high posterior probability, so that more updates 
are accepted and mixing is improved relative to a scheme in which the 9 in each 
population were updated individually. 

Updates to an, aE, and ^2 are accomplished via a random-walk sampler 
(adding a normally distributed random variable with mean zero and small vari- 



ance to the current value) (Gilks et al. , 1996). Updates to elements of fie are also 



accomplished via a random-walk sampler, and again the updates to each locus are 
accepted or rejected independently. 



This algorithm is implemented in the statistical platform R (R Development 



Core Team, 2007). Default values of tuning parameters (variances in the proposal 
distributions for /i, 6^, a^, as, and 0:2) and the number of generations required to 
accurately describe the joint posterior will vary from dataset to dataset, and so 
may require tuning. As always when using MCMC, trace plots should be examined 
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to assess speed of convergence; for an excellent and more thorough description of 
MCMC inference, see (|Gilks et aL[[l996|. 



Model Adequacy 

Our model is a simplification of the potentially complex correlations present in 
the data, and there are likely other correlates of differentiation not included in the 
model. Therefore, it is important to test the model's fit to the data, and to high- 
light aspects of its performance that are inadequate. To assess model adequacy, we 
use posterior predictive sampling, with the set of pairwise population Fst values 
as a summary statistic. In posterior predictive sampling, draws of parameters are 
taken from the posterior and used to simulate new datasets, summaries of which 



can be compared to those observed in the original datasets (Gelman et al. , 1996). 

Our posterior predictive sampling scheme proceeds as follows. For each repli- 
cate of the simulations we 

1. Take a set of values of a and P from their joint posterior, i.e. our MCMC 
output. 

2. Compute a covariance matrix Q from this set of a and the pairwise geographic 
and ecological distance matrices from the observed data. 

3. Use Q to generate L multivariate normally distributed 6, and use P to gen- 
erate a set of normally distributed /i. These 6 and /i are transformed using 
equation Q into allele frequencies for each population-locus combination, 
and binomially distributed allele counts are sampled using those frequen- 
cies and the per-population sample sizes from the observed data, with the 
requirement that no loci are monomorphic. 
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4. Calculate Fst between each pair of populations across all loci. 

As our interest is primarily in understanding the fit of our model to the observed 
genetic distance between pairs of populations, we use pairwise Fst as the summary 
statistic of our sample frequency data, which we calculate as 

TP r ■\ 1 ^^^^^ 1^^^^ fn\ 

FsTi^,3) = l V rr. ^r, VI 

2^ [Pii+Pji)[i- 2^) 

1<£<L 

where L is the number of loci and pu and pj£ are the empirical allele frequencies 
in populations i and j at locus i (e.g. Ce^i/Si^i). We note that this estimator of 



Fst is not unbiased and that other estimators of pairwise Fst are available (Weir 



and Hill, 2002), but as Fst is calculated in the same manner for both real and 
simulated data, this choice is suitable as a descriptive statistic. 

We then use various visualizations of FsT^hj), e.g. plotted against distance 
between i and j, to compare the patterns in the observed dataset to the patterns 
in the simulated datasets. This functions as a powerful and informative visual 
summary of the ability of the model to describe the observed data. Users can assess 
how well the method is able to pick up general trends in the data (e.g. increasing 
genetic differentiation with ecological or geographic distance) and how well those 
general trends in the model match the slope of their observed counterparts, and 
also identify specific pairwise population comparisons that the model is doing a 
particularly poor job describing. These latter may help reveal other important 
processes that are generating genetic differentiation between populations, such as 
unmeasured ecological variables, or heterogeneity in population demography. 



13 



Accounting for overdispersion 

One obvious way in which our model is inadequate is the assumption that all of 
our populations are equally variable. Specifically, we have assumed that all pop- 
ulations have the same variance around the global mean, and only differ in the 
strength of their covariance with one another. There arc numerous biological pro- 
cesses that would invalidate this assumption, including local differences in inbreed- 
ing, admixture, bottlenecks, or colonization histories. Indeed, in both empirical 
datasets examined in this paper, there are clearly populations that show much 
more variability than predicted by simple geographic and ecological distances. 

Above, we assumed that the observed alleles in each population are indepen- 
dent draws from a population-specific allele frequency, so that all correlations 
between alleles in a given population are due to their shared allele frequency, 
and the strength of the correlation is the same in each population. However, in 
practice, the allele draws in some populations are more correlated than in oth- 
ers. A concrete way to see this would be to note that the normalized residuals 
{Ci,k - f{Oe,k,IJ'i)Se,k)/\/Si,kf{Oe,k,IJ'e){i - f{9i,k,IJ'i)), across loci and the poste- 
rior distribution, have larger variance in some populations than others, i.e. greater 
variance than we would expected given the same binomial sampling process in all 
populations. 

We account for this by introducing a population-specific inbreeding coefficient, 
Fk in population A;, and including independent priors on each Fk- Mathematically, 
we replace the fixed allele frequency f{0£^k, l-i'e) by a random, beta-distributed allele 

frequency q^k with parameters f{6e,k, f^e){'^-Fk)/Fk and {1- f{6i^k, l^e)){'^-Fk)/ Fk 
(which has mean f{6e,k,l^e) and variance f{9e^k, l^e){'^ - f{de,k,l^e))Fk/{l - Fk)) as 
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done in (Balding and Nichols, 1995, 1997)). The random frequency integrates out 



so that the likelihood of the count data becomes 



P{C£,k\Se,k, f£,k — f{0e,k, fJ'i)) 



l,k 



( fe,k{'^-Pk)/Fk \ 
^{-^-ft,k){l-Fk)/Fj 



(10) 



The inbreeding coefficient Ft can be interpreted as an F-statistic (Wright, 1943), 



summarizing the correlation in subsequent draws of alleles from within population 



k (i.e. an inbreeding coefficient) (Cockerham and Weir, 1986 Balding, 2003), or 
equivalently the variance of allele frequencies relative to expected from the geo- 
graphically predicted frequency for the /c*'^ population: = Var[g£ fc]//(6'^ fc, — 
fiOi,k,IJ'e))- 

To combine estimation of F^. into our inference framework we place an exponen- 
tial prior on Fk/{1 — Fk) ( ~ Exp(5)), putting more prior mass on low values of Fk- 
This prior and P{Ce^k\Se,k, fe,k = /(^'^.fe, /i^)) are incorporated into our posterior. 
In the MCMC, initial values of F^ are drawn from the prior for each population. 
Updates are proposed one population at a time via a random-walk step, and are 
accepted or rejected independently. 



Simulation Study 

We conducted a simulation study to evaluate the performance of the method. Each 
simulated dataset consisted of 30 populations each with 10 diploid individuals 
sequenced at 1000 polymorphic bi-allelic loci. Separately for each dataset, the 
geographic locations of the populations were sampled uniformly from the unit 
square, and geographic distances (-Dij) were calculated as the Euclidean distance 
between them. We also simulated geographically autocorrelated environmental 
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variables, some continuous, some discrete (see Figure [Tp and c). For both discrete 
and continuous variables we simulated datasets in which ecological distance had 
no effect on genetic differentiation between populations; these simulations tested 
whether our method avoids the false positive issues of the partial Mantel test. We 
also simulated datasets with an effect of both geographic and ecological distance 
on genetic distance across a range of relative effect sizes (varying the ratio aE/an) 
to test our power to detect their relative effects. The study thus consisted of 
four sections, each comprised of 50 datasets: discrete and continuous ecological 
variables, with or without an effect of ecology. 

For each dataset, we set = 0.5, and sampled ao and 02 from uniform distri- 
butions (f/(0.2,4) and f/(0.1,2) respectively); the choice of aE varied, depending 
on the specific scenario (described below). These parameters were chosen to give 
a range of pairwise population Fst spanning an order of magnitude between ap- 
proximately 0.02 and 0.2, and a realistic allele frequency spectrum. The covariance 
matrix Q was calculated using these a and the pairwise geographic and ecological 
distance matrices (normalized by their standard deviations), and Q was used to 
generate the multivariate, normally distributed 6. Values of fi were drawn from 
a normal distribution with variance /3 = 0.09. Allele frequencies at each locus 
were calculated for each population from the 9 and /i using equation ([T]), and 
SNP counts at each locus in each population were drawn from binomial distribu- 
tions parameterized by that allele frequency with the requirement that all loci be 
polymorphic. We simulated under the following ecological scenarios. 

1. Continuous, Autocorrelated Ecological Variable For the continuous 
case, we simulated the values of the ecological variable e across populations by 
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Simulated Populations: 
Continuous Ecological Variable 




Isolation by Distance 
Continuous Ecological Variable 




Longitude 



Pairwise Geographic Distance (Djj) 




Figure 1: a) Populations simulated in the unit square, colored by their value of a 
continuous ecological variable, e. b) Pairwise Fst between simulated populations 
from (a), colored by difference in their values of e. c) Populations simulated in the 
unit square, colored by their value of a binary ecological variable, e. d) Pairwise 
Fst between simulated populations from (c), colored by difference in their values 
of e. 



sampling from a multivariate normal distribution with mean zero and covariance 
between population i and population j equal to Cov {E (i), E{j)) = exp(— Dj^/ac), 



where Oc determines the scale of the autocorrelation (following ( Guillot and Rous 



set 



2012[ )). For all simulations, we set = 0.7, to represent a reasonably dis- 
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tributed ecological variable on a landscape. 



2. Binary Ecological Variable A binary variable was produced by declaring 
that the latitudinal equator in the unit square was a barrier to dispersal, so that 
all populations on the same side of the barrier were separated by an ecological 
distance of zero, and all population pairs that spanned the equator were separated 
by an ecological distance of 1. 

A. Zero Effect Size For each type of ecological variable, we produced 50 sim- 
ulated datasets with aE = 0, so that ecological distance had no effect on the 
covariance of 9, and hence on genetic differentiation between populations. For 
each of these simulated datasets, we performed a partial Mantel test in R using 



the package ecodist (Goslee and Urban, 2007) with 1,000,000 permutations 



B. Varying Effect Size We also produced 50 simulated datasets for each type 
of ecological variable by simulating ten datasets for each value of aEjo-D from 0.2 
to 1.0 in intervals of 0.2 (see Figure[T^ and d\ (As above, values of a/? were drawn 
from a uniform distribution (f/(0.2,4)), so this determines a^-) 

All analyses on the simulated datasets were run for 1,000,000 MCMC iterations, 
which appeared sufficient in most cases for convergence on the stationary distribu- 
tion. The chain was sampled every 1,000 generations, and all summary statistics 
from the simulation study were calculated after a burn- in of 20%. The metrics of 
method performance used were precision, accuracy, and coverage of the aE '■ old 
ratio. We defined 'precision as breadth of the 95% credible set of the marginal 
posterior distribution; accuracy as the absolute value of the difference between 
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the median value of the marginal posterior distributions and the values used to 
simulate the data in each dataset; and coverage as the proportion of analyses for 
which the value used to simulate the data fell within the 95% credible set of the 
marginal posterior distribution for that parameter. 

For approximately 30% of all analyses, the MCMC runs displayed obvious dif- 
ficulty with convergence within the first 1,000,000 generations. In some cases, this 
was because the naive scales of the various tuning parameters of the random-walk 
proposal mechanisms were inappropriate for the particular dataset, and mixing was 
too slow over the number of generations initially specified (as diagnosed by visual- 
izing the parameter acceptance rates of MCMC generations). This was addressed 
by re-running analyses on those datasets using different random-walk tuning pa- 
rameters, or by increasing the number of generations over which the MCMC ran. 
In the other cases, failure to converge was due to poor performance of the MCMC 
in regions of parameter space too near the prior boundaries. Specifically, when the 
chain was randomly started at values of some a parameters too close to zero, it was 
unable to mix out of that region of parameter space. This problem was addressed 
by re-running the analyses using different, randomly chosen initial values for the 
a parameters. 

Empirical Data 

To demonstrate the utility of this method, we applied it to two empirical datasets: 
one consisting of populations of teosinte {Zea mays), the wild progenitor of maize, 
and one consisting of human populations from the HGDP panel. Both processed 
datasets are available for download at genescape.org. See Tables SI and S2 in the 
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Supplementary Materials for names and metadata of populations used. 

The teosinte dataset consisted of 63 populations of between 2 and 30 diploid 



individuals genotyped at 978 biallelic, variant SNP loci (Fang et al. , 2012). Each 



population was associated with a latitude, longitude, and elevation at the point of 



samphng (see Figure S2 and Table SI). Pairwise geographic great-circle distances 
and ecological distances were calculated for all pairs of populations, where ecolog- 
ical distance was defined as the difference in elevation between populations. Both 
pairwise distance variables were normalized by their standard deviations. 

The human dataset was the Eurasian subset of that available from the HGDP 



(Conrad et al. , 2006 Li et al. , 2008), consisting of 33 populations of between 6 



and 45 individuals genotyped at 1000 biallelic, variant SNP loci (see Figure S3 
and Table S2). Pairwise geographic great-circle distances and ecological distances 
were calculated for all pairs of populations, where ecological distance was defined 
as or 1 if the populations were on the same or opposite side of the Himalaya 
mountain range, respectively. The western edge of the Himalaya was defined at 
75° East. 

For comparison, the method was run on each of the two datasets both with 
and without the "beta-binomial" overdispersion model. MCMC marginal traces 
were examined visually to assess convergence on a stationary distribution. The 
chain was thinned by sampling every 1000 generations, and the median and 95% 
credible sets were reported on the marginal distribution after a burn-in of 20%. 
The MCMC analysis for the teosinte dataset without the overdispersion model was 
run for 10 million generations; the analysis with the overdispersion model was run 
for 15 million generations. For the HGDP dataset, the numbers of generations were 
25 million and 35 million, for the analyses without and with the overdispersion 
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model, respectively. 



Results 

Simulation Results 

As described above, we conducted four sets of simulation analyses. The perfor- 
mance of the method in inference of the parameters of greatest interest is given 
below. 



First we note that, consistent with the results of (Guillot and Rousset, 2012), 
the spatial autocorrelation in our ecological variable caused the partial Mantel 
to have a high false positive rate when ue = 0, which suggests that the partial 
Mantel test is not well calibrated to assess the significance of ecological distance 
on patterns of genetic differentiation. At a significance level oi p = 0.05, the 
false positive rate for the binary ecological distance variable was 16%, and for the 
continuous ecological variable, the false positive error rate was 22% (see Figure 



S4j). 

The precision and accuracy results for the simulations with a continuous and 
discrete ecological variable are visualized in Figure panels |2]a and 6, respectively, 
across the six simulated values of the ratio aE/ao- Median precision, accuracy, 
and coverage are reported in Table [ij 
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Figure 2: a) Performance of the method for the 100 datasets simulated with a 
continuous ecological distance variable. Panel 1 depicts performance on the 50 
datasets for which aE was fixed at 0, and panel 2 depicts performance on the 50 
datasets for which qe varied, b) Performance of the method for the 100 datasets 
simulated with a binary ecological distance variable. Panel 1 depicts performance 
on the 50 datasets for which qe was fixed at 0, and panel 2 depicts performance 
on the 50 datasets for which aE varied. 





Sim Study lA 


Sim Study IB 


Sim Study 2A 


Sim Study 2B 


Precision 


0.041 


0.30 


0.15 


0.96 


Accuracy 


0.013 


0.0066 


0.031 


0.033 


Coverage 


NA 


94% 


NA 


94% 



Table 1: Simulation Studies lA and IB were conducted with a continuous ecolog- 
ical variable and aE = and a^; > 0, respectively. Simulation Studies 2A and 
2B were conducted with a binary ecological variable and aE = and > 0, 
respectively. Coverage is not reported for the simulations in which the effect size 
of the ecological distance variable was fixed to zero {aE = 0), as the parameter 
value used to generate the data is on the prior bound on aE, and coverage was 
therefore zero. 
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Empirical Results 
Teosinte Results 

For the Zea mays SNP dataset analysis, the mean and median of the posterior ratio 
of the effect size of pairwise difference in elevation to the effect size of pairwise 
geographic distance (i.e.- the aE '■ ratio) was 0.153, and the 95% credible set 
was 0.137 to 0.171 (see Figure [S7|a). The interpretation of this ratio is that 
one thousand meters of elevation difference between two populations has a similar 
impact on genetic differentiation as around 150 (or, 137-171) kilometers of lateral 
distance. 

Accounting for overdispersion (using the beta-binomial model) we obtain slightly 
different results, with a mean and median aE '■ ratio of 0.205, and a 95% cred- 
ible set from 0.180 to 0.233 (1,000 meters difference in elevation ^ 205 kilometers 



lateral distance, see Figure S7b). Values of our F statistics estimated across 



populations ranged from 2x10 to 0.53. Population values of Fk are shown on the 



map in Supplemental Figure S2 The five populations with the highest estimated 
Fk values are labeled as (in order from highest to lowest Fk) 10 km S of Degollado, 
La Estancia, Zoquiapan, Ayotlan, and Taretan (La Perimera). 

The posterior predictive sampling on the posterior parameter distributions of 
the last 5 million generations of the analyses indicates a better fit to the data with 
the beta-binomial extension (see Figure [3]a and b); the mean Pearson's product 
moment correlation between the posterior predictive datasets and the observed 
data without the beta-binomial extension was 0.87, while the mean correlation 
with the beta-binomial model was 0.95 (see Figure [Sl]a). The ability of the model 



to predict specific pairwise population Fst is shown Figure S5 
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Posterior Predictive Sampling of iBD in Zea mays 




Posterior Predictive Sampling of IBD in Zea mays 
Beta-Binomial Extension 
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Figure 3: Posterior predictive sampling with 1,000 simulated datasets, using pair- 
wise FsT as a summary statistic of the allelic count data for: a) the Teosinte 
dataset, using the standard model; b) the Teosinte dataset, using the overdisper- 
sion model; c) HGDP dataset, standard model, d) HGDP dataset, overdispersion 
model. 



HGDP results 

For the human (HGDP) SNP dataset analysis, the mean posterior ue '■ ao ratio 
was 5.13 X 10*^, the median was 5.00 x 10^, and the 95% credible set was 3.09 x 10'^ 



to 7.85 X 10^ (see Figure S8a). The interpretation of this ratio is that, for two 



24 



populations, being on the opposite side of the Himalaya mountain range has the 
impact of between approximately 35 and 83 thousand kilometers of extra pairwise 
geographic distance on their genetic differentiation. 

Implementing the beta-binomial extension of this method on the same dataset 
yields significantly different results, with a mean ue '■ old ratio of 1.35 x 10*^, a 
median of 1.34 x lO'', and a 95% credible set from 1.09 x 10'' to 1.65 x lO'' (see 



Figure [S8p). This result is broadly consistent with that of (Rosenberg, 2011), 
who found an effect size ratio of 9.52 x 10^ in a linear regression analysis that 
treated pairwise population comparisons as independent observations. Values of 
Fk estimated across populations ranged from 3.2 x 10^^ to 0.06. Population values 



of Fk are shown on the map in Figure S3 The five human populations with the 



highest estimated F^ values are (in order from highest to lowest F^) the Kalash, 
the Lahu, the Mozabites, the Hazara, and the Uygur. 

The posterior predictive sampling on the posterior parameter distributions of 
the last 5 million generations of the analysis indicates a better fit to the data with 
the beta-binomial extension (see Figure [3]c and d)] the mean Pearson's product 
moment correlation between the posterior predictive datasets and the observed 
data without the beta-binomial extension was 0.89, while the mean correlation 



with the beta-binomial model was 0.92 (see Figure SI?)). The ability of the model 



to predict specific pairwise population Fst is shown in Figure S6 



Discussion 

In this paper, we have presented a method that uses raw allelic count data to 
infer the relative contribution of geographic and ecological distance to genetic 
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distance between sampled populations, and our results on its performance as ap- 
plied to simulated and empirical datasets are encouraging. This area of research 



has received recent attention (see, for example (Wang et al. , 2012) for an exciting 
multiple- matrix regression approach in a structural equation modeling framework); 
however, we feel that our method has broad utility to the field of landscape ge- 
netics and to studies of local adaptation, and holds a number of advantages over 
existing methods. It allows users to simultaneously quantify effect sizes of ge- 
ographic distance and ecological distance (rather than assessing the significance 
of a correlation once the effect of geography has been removed, as in the par- 
tial Mantel test). Explicitly modeling the covariance in allele frequencies allows 
users to accommodate non-independence in the data, and the method's Bayesian 
framework naturally accommodates uncertainty and provides a means of evaluat- 
ing model adequacy. In addition, the basic model presented here - a parametric 
model of spatial covariance in allele frequencies - is extremely versatile, allowing 
for the inclusion of multiple ecological or geographic distance variables, as well as 
great flexibility in the function used to model the covariance. 

Simulation Study 

Our method performed well in all four simulation studies (see Figure |2] and Table 
[T|. Accuracy in all cases was high, indicating that the method is able to effec- 
tively recognize and indicate when a ecological variable contributes significantly 
to genetic differentiation. This is in contrast to the partial Mantel, which has a 
high false positive rate in the presence of spatial autocorrelation of environmental 



variables (see Figure S4) 
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Coverage, accuracy, and precision were all satisfactory (see Table [T|. The 
precision of our estimator of was generally lower for our discrete ecological 
variable, likely due to the strong spatial structure of the discrete ecological variable. 

An issue we observed in practice is that at some parameter values, different 
combinations of a are essentially nonidentifiable — the form ^{aEEij + aoDij)"^'^ 
sometimes allows equally reasonable fits at different values of 0:2, or at different 
combinations of ao and ao^E- However, in such cases the ue '■ an ratio, which is 
the real parameter of interest, remains constant across the credible region, even 
as aE and an change together to compensate for changing ao,2- Such 'ridges' in 
the likelihood surface are readily diagnosed by viewing the joint marginals of the 
a parameters. 



Empirical Results 
Teosinte 

The application of our method to the teosinte SNP dataset indicated that difference 
in elevation has a potentially substantial contribution to genetic differentiation 
between teosinte populations. Difference in elevation could be correlated with 
another, as yet unmeasured ecological variable, so we cannot claim to report a 
causal link, but these results are certainly suggestive, especially in the light of 



the research on morphological adaptations in teosinte to high altitude (Eagles and 



Lothrop 1994). 



The analysis of the teosinte SNP data with the beta-binomial extension of our 
method shows a much better model fit, and highlights a number of populations 



with particularly high values. These populations (highlighted in Figure S2 ) all 
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belong to the subspecies Zea mays mexicana, which primarily occurs at higher alti- 
tudes and is hypothesized to have undergone significant drift due to small effective 



population sizes or bottlenecks (Fukunaga et al. , 2005). In addition, a number 



of these populations occur in putative hybrid zones between Zea mays mexicana 



and Zea mays parviglumis, a separate, co-occuring subspecies (Heerwaarden et al. 



2011). Like drift, admixture would have the effect of increasing the variance in 
observed allele frequencies around the expectation derived from the strict geo- 
graphic/ecological distance model, and would drive up the inferred Fk parameters 
for admixed populations. 

HGDP 

In the Human Genome Diversity Panel data we find a strong effect of separation by 



the Himalayas on genetic differentiation, confirming previous results (Rosenberg 



et al. , 2005 ). To obtain a good fit to the data it is necessary to model overdispersion 
(with the beta-binomial extension). This lack of model fit of the basic model can 
be seen in the posterior predictive sampling in Figure [3]c and d, which highlights 
the importance of assessing model adequacy during analysis. With overdisper- 
sion included, the model appears to describe the data well, suggesting substantial 
heterogeneity beyond that dictated by geographic distance and separation by the 
Himalayas between the sampled populations. A number of populations stand out 
in their Fk values, in particular the Kalash, the Lahu, the Mozabites, the Hazara, 



and the Uygur (highlighted in Figure S3). This is consistent with the known his- 



tory of these populations and previous work on these samples (Rosenberg et al 



2002), which suggests that these populations are unusual for their geographic posi- 



tion (that is, they depart from expectations of their covariance in allele frequencies 
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with their neighbors). The Hazara and Uygur populations are known to be recently 
admixed populations between central Asian and East ancestry populations. The 
Mozabite population has substantial recent admixture from African populations 



(Rosenberg et al. 2002 Rosenberg 2011 ). The Kalash, who live in northwest Pak- 



istan, are an isolated population with low heterozygosity, suggesting a historically 
small effective population size. Finally the Lahu have unusually low heterozygosity 
compared to the other East Asian populations, suggesting that they too may have 
had an unusually low effective population size. Thus our beta-binomial model, in 
addition to improving the fit to the data, is successfully highlighting populations 
that are outliers from simple patterns of isolation by distance. 

Population-specific drift 

As noted above, in both empirical datasets analyzed, the beta-binomial extension 
to the basic model offers substantially better model fit. This could in part reflect 
ecological variables not included in the analyses, in addition to heterogeneity in 
demographic processes, both of which could shape genetic variation in these popu- 
lations by pushing population allele frequencies away from their expectations under 
our simple isolation by distance and ecology model. Our Fk statistic provides a 
useful way to highlight populations that show the strongest deviations away from 
our model, and to prevent these deviations from obscuring environmental correla- 
tions or causing spurious correlations. Therfore, we reccommend that the extended 
model be used as the default model for analyses. 

We do note, however, that these parameters do not provide a way to dis- 
tinguish between different explanations of the over-dispersion, and should be in- 
tepreted with caution. Also, care must be taken in interpreting the F^ when there 
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are strong differences in the sample sizes across populations or regions. Due to 
our likelihood-based framework;, when there is heterogeneity in sample size among 
populations, those populations or regions that are poorly sampled will naturally 
be weighted less when determining the shape of the covariance function, and so 
are more likely to be forced to have higher Fk than populations and regions with 
larger sample sizes, which will tend to receive greater accommodation in the de- 
termination of the shape of the covariance curve. Given these issues, we think the 
best application of the Fk statistics is to highlight populations that are experienc- 
ing or have experienced processes not accommodated in the geographic/ecological 
covariance structure of the majority of the sample and ear-mark them for further 
examination. 



Limitations 

The flexibility of this statistical model is accompanied by computational expense. 
Depending on the number of loci and populations in a dataset, as well as the 
number of MCMC generation required to accurately describe the stationary dis- 
tribution, analyses can take anywhere from hours to days. Speedups could be 
obtained by parallelization or porting code to C. In addition, as with any method 
that employs an MCMC algorithm, users should take care to assess MCMC perfor- 
mance to ensure that the chain is mixing well, has been run for a sufficient number 



of generations, and has converged on a stationary distribution (Gilks et al. , 1996). 
Users are well advised to run multiple independent chains from random initial lo- 
cations in parameter space, and to compare the output of those analyses to confirm 
that all are describing the same stationary distributions. 
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Extensions 

The flexibility of this method translates well into extendability. Among a number 
of natural extensions the community might be interested in implementing, we 
highlight a few here. 

One natural extension is to incorporate different definitions of the ecological 
distance between our populations. Just because two populations have no differ- 
ence in their ecological variable state does not guarantee that there is not great 
heterogeneity in the distance between them. For example, a pair of populations 
separated by the Grand Canyon might have nearly identical elevations, but the 
cost to migrants between them incurred by elevation may well be significant. One 
solution to this would be to enter a simple binary barrier variable, or to calculate 
least-cost paths between populations, and use those distances in lieu of geographic 
distance. A more elegant solution would be to use "isolation by resistance" dis- 
tances, obtianed by rasterizing landscapes and employing results relating mean 
passage rates of random walks in a heterogeneous environment to quantities from 
circuit theory in order to calculate the conductance (ease of migration) between 



nodes on that landscape (McRae and Beier, 2007). This method has the advan- 



tage of integrating over all possible pathways between populations. Currently, 
users must specify the resistance of landscape elements a priori, but those resis- 
tance parameters could be incorporated into our parametric covariance function, 
and estimated along with the other parameters of our model in the same MCMC. 
This approach carries great appeal, as it combines the conceptual rigor of accom- 
modating multiple migration paths with the methodological rigor of statistically 
estimated, rather than user-specified, parameter values. 
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Another extension is the further relaxation of the assumption of process ho- 
mogeneity in decay of allehc covariance over geographic and ecological distance. 
Specifically, the method currently requires that a single unit of pairwise ecological 
distance translate into the same extent of pairwise genetic differentiation between 
all population pairs. This assumption is unlikely to be realistic in most empirical 
examples, especially if populations are locally adapted. For example, individuals 
from populations adapted to high elevation may be able to migrate more eas- 
ily over topography than individuals from populations adapted to low elevations. 
Such heterogeneity could be accommodated by using different covariance functions 
for different, pre-specified population pairings. 

A final extension that could be integrated into this method is a model selection 
framework, in which models with and without an ecological distance variable, 
or with different combinations of ecological distance variables, can be rigorously 
compared. Because our method is implemented in a Bayesian framework, we 
could select between models by calculating Bayes factors (the ratio of the marginal 



likelihoods of the data under two competing hypotheses) (Dickey, 1971 Verdinelli 



and Wasserman , 1995 ) . This approach would seem to offer the best of both worlds: 



robust parameter inference that accommodates uncertainty in addition to output 
that could be interpreted as definitive evidence for or against the association of an 
ecological variable of interest with genetic differentiation between populations. 



Conclusion 

In closing, we present a tool that can be useful in a wide variety of contexts, 
allowing a description of the landscape as viewed by the movements of genetic ma- 



32 



terial between populations. We urge users to be cautious in their interpretation 
of results generated with this model. A correlation between genetic differentiation 
and an ecological distance variable does not guarantee a causal relationship, espe- 
cially because unmeasured ecological variables may be highly correlated with those 
included in an analysis. In addition, evidence of a correlation between genetic dif- 
ferentiation and an ecological variable may not be evidence of local adaptation 
or selection against migrants, as both neutral and selective forces can give rise to 
an association between genetic divergence and ecological distance. Finally, we are 
making the source scripts for this method available online at genescape.org, and we 
hope that users elaborate on the framework we present here to derive new models 
that are better able to describe empirical patterns of isolation by distance — both 
geographic and ecological. 
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Supplemental material 



Comparing Posterior Predictive Fit 
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Figure SI: Distribution of Pearson's correlations between each posterior predictive 
simulated dataset and the observed data, highlighting the improved fit of the 
overdispersion model to describe: a) the Teosinte dataset; b) the HGDP dataset. 
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Population name (sample size) 


Latitude 


Longitude 


Elevation 


Subspecies 


1 - 


Km 1 El Crustel-Teloloapan (44) 


18, 


.383 


18, 


.383 


985 


parviglumis 


2 - 


Amates Grandes (50) 


18, 


.388 


18, 


.388 


1110 


parviglumis 


3 - 


Km 3 Amates Grandes-Teloloapan (48) 


18 


.394 


18 


.394 


1210 


parviglumis 


4 - 


Km 72 Iguala- Arcclia (Km Alcholoa-Arcelia) (56) 


18 


.414 


18 


.414 


1506 


parviglumis 


5 - 


Rincn del Sauce (56) 


18 


.35 


18 


.35 


1624 


parviglumis 


6 - 


Ahuacatitln (km 1.5 del entronque) (38) 


18 


.356 


18, 


.356 


1528 


parviglumis 


7 - 


Km 80 HuBtamo- Villa Madero (60) 


19, 


.063 


19, 


.063 


832 


parviglumis 


8 - 


Puerto de la Cruz (Km 119 Huetamo-V. Madero) (40) 


18, 


.963 


18, 


.963 


870 


parviglumis 


9 - 


El Zapote (km 122 Huctamo-Caracuaro) (50) 


18, 


.938 


18, 


.938 


915 


parviglumis 


10 


- Puerto El Coyote (40) 


18, 


.916 


18, 


.916 


727 


parviglumis 


11 


- Km 135-136 Huetamo- Villa Madero (40) 


18, 


.9 


18, 


.9 


677 


parviglumis 


12 


- Cuirindalillo (km 142 Huetamo-Caracuaro) (42) 


18, 


.883 


18, 


.883 


697 


parviglumis 


13 


- Crucero Puertas de Chiripio (50) 


18 


.794 


18, 


.794 


653 


parviglumis 


14 


- Quenchendio (km 151.5 Zitcuaro-Huctamo) (54) 


18 


.805 


18 


.805 


635 


parviglumis 


15 


- El Potrero (km 145.5 Zitcuaro-Huctamo) (40) 


18 


.82 


18 


.82 


654 


parviglumis 


16 


- La Crucita (km 13.5 Zitcuaro-Huctamo) (58) 


18 


.858 


18 


.858 


609 


parviglumis 


17 


- El Ciuayabo (km 132. .5 Zitcuaro-Huetamo) (64) 


18 


.862 


18 


.862 


555 


parviglumis 


18 


- Km 107-108 Toluca-Altamirano (50) 


18 


.899 


18 


.899 


1422 


parviglumis 


19 


- Km 112 Toluca-Altamirano (46) 


18, 


.896 


18, 


.896 


1356 


parviglumis 


20 


- Km 119 Toluca-Altamirano (38) 


18, 


.854 


18, 


.854 


1015 


parviglumis 


21 


- Salitre-Monte de Dios (46) 


18 


.842 


18, 


.842 


958 


parviglumis 


22 


- Tarctan (La Perimcra) (36) 


19 


.344 


19 


.344 


1170 


par\'iglumis 


23 


- Los Guajes (km 43 Zitcuaro-Huetamo) (54) 


19 


.231 


19 


.231 


985 


par\'iglumis 


24 


- 1 Km Norte de Santa Ana (54) 


19 


.281 


19, 


.281 


1332 


parviglumis 


25 


- Km 8 Zuluapan-Tingambato (58) 


19, 


.148 


19, 


.148 


1178 


parviglumis 


26 


- Km 4 Zuluapan-Tingambato (60) 


19 


.146 


19, 


.146 


1346 


parviglumis 


27 


- K2 Zacazonapan-Otzoloapan (56) 


19 


.079 


19, 


.079 


1468 


parviglumis 


28 


- K22 Zacazonapan-Luvianos (EL Puente) (66) 


19 


.039 


19, 


.039 


1085 


parviglumis 


29 


- Acatitln-El Puente (50) 


19 


.029 


19, 


.029 


1075 


parviglumis 


30 


- Queretanillo (56) 


19, 


.551 


19, 


.651 


1342 


parviglumis 


31 


- Km 33.5 Temascal-Huetamo (56) 


19, 


.483 


19, 


.483 


1100 


parviglumis 


32 


- Km 37 Temascal-Huetamo (40) 


19 


.464 


19, 


.464 


1030 


parviglumis 


33 


- Casa Blanca (km 62 Huetamo- Villa Madero) (64) 


19 


.161 


19 


.161 


1268 


parviglumis 


34 


- San Antonio Tecomitl (4) 


19 


.217 


19 


.217 


2400 


mexicana 


35 


- Ozumba (4) 


19 


.05 


19 


.05 


2340 


mexicana 


36 


- Temamatla (6) 


19 


.183 


19 


.183 


2400 


mexicana 


37 


- Zoquiapan (4) 


19 


.317 


19, 


.317 


2270 


mexicana 


38 


- Los Reyes La Paz (6) 


19, 


.4 


19, 


.4 


2200 


mexicana 


39 


- Miraflorcs (4) 


19, 


.217 


19, 


.217 


2200 


mexicana 


40 


- Tepetlixpa (4) 


19 


.017 


19 


,017 


2320 


mexicana 


41 


- El Pcdrcgal (4) 


19 


.267 


19 


.267 


2500 


mexicana 


42 


- Mexicaltzingo (4) 


19, 


.217 


19 


.217 


2600 


mexicana 


43 


- Santa Cruz (4) 


19, 


.083 


19, 


.083 


2425 


mexicana 


44 


- San Antonio (4) 


19 


.067 


19, 


.067 


2440 


mexicana 


45 


- San Salvador (4) 


19 


.133 


19 


.133 


2425 


mexicana 


46 


- Tlachichuca (4) 


19 


.167 


19 


.167 


2355 


mexicana 


47 


- K3 San Salvador El Seco-Coatepcc (4) 


19 


.117 


19, 


.117 


2425 


mexicana 


48 


- San Nicolas B. Aires (4) 


19, 


.167 


19, 


.167 


2355 


mexicana 


49 


- San Felipe (4) 


19, 


.517 


19, 


.517 


2250 


mexicana 


50 


- 4 miles N of Hidalgo, Arroyo Zarco (4) 


19, 


.7 


19, 


.7 


2040 


mexicana 


61 


- 5-7 km SW Cojumatlan (4) 


20 


.1 


20 


.1 


1700 


mexicana 


52 


- Puente Gavilanes (4) 


24, 


.017 


24, 


.017 


1950 


mexicana 


53 


- La Estancia (4) 


21 


.5 


21 


.5 


1920 


mexicana 


54 


- Morolcon (4) 


20, 


.083 


20, 


.083 


2100 


mexicana 


55 


- Pinicuaro (8) 


20, 


.05 


20, 


.05 


2087.6 


mexicana 


56 


- Puruandiro (4) 


20, 


.083 


20 


.083 


2000 


mexicana 


57 


- km 2 Puruandiro-Las Tortugas (4) 


20 


.117 


20, 


.117 


1880 


mexicana 


58 


- 10 km S of Degollado (4) 


20 


.367 


20 


.367 


1625 


mexicana 


59 


- Ayotlan (4) 


20 


.417 


20 


.417 


1520 


mexicana 


60 


- Churitzio (8) 


20 


.175 


20 


.175 


1780 


mexicana 


61 


- El Salitre 1-2 km SE (4) 


20, 


.183 


20, 


.183 


1630 


mexicana 


62 


- Rancho El Tejocote (4) 


20, 


.167 


20, 


.167 


1750 


mexicana 


63 


- Villa Escalante (6) 


19, 


.4 


19, 


.4 


2320 


mexicana 



Table SI: Metadata for populations used in the teosinte dataset. 
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Side of the Himalayas 


1 - rtaygei lou; 


44 


39 












o - itaiia,!! yAyj ^ 


46 


10 


W 


4 - French (52) 


46 


2 


W 


6 - Orcadian (28) 


69 


-3 


W 


6 - Russian (46) 


61 


40 


W 


7 - Sardinian (46) 


40 


9 


W 


8 - Tuscan (10) 


43 


11 


W 


9 - Bedouin (86) 


31 


35 


W 


10 - Druze (78) 


32 


35 


W 


11 - Mozabite (50) 


32 


3 


W 


12 - Palestinian (88) 


32 


35 


W 


13 - Balochi (44) 


30.5 


66.5 


W 


14 - Brahui (46) 


30.5 


66.5 


W 


15 - Burusho (48) 


36.5 


74 


W 


16 - Hazara (40) 


33.5 


70 


W 


17 - Kalash (44) 


36 


71.5 


W 


18 - Malcrani (48) 


26 


64 


W 


19 - Pathan (40) 


33.5 


70.5 


W 


20 - Sindhi (44) 


25.5 


69 


W 


21 - Cambodian (16) 


12 


105 


E 


22 - Dai (18) 


21 


100 


E 


23 - Daur (14) 


48.5 


124 


E 


24 - Han (64) 


32.5 


114 


E 


25 - Hczhcn (16) 


47.5 


133.5 


E 


26 - Japanese (50) 


38 


138 


E 


27 - Lahu (12) 


22 


100 


E 


28 - Miao (10) 


28 


109 


E 


29 - Mongola (18) 


48.5 


119 


E 


30 - Naxi (14) 


26 


100 


E 


31 - Oroqcn (16) 


50.5 


126.5 


E 


32 - She (18) 


27 


119 


E 


33 - Tu (18) 


36 


101 


E 


34 - Tujia (18) 


29 


109 


E 


35 - Uygur (18) 


44 


81 


E 


36 - Xibo (Ifi) 


43.5 


81.5 


E 


37 - Yakut (46) 


63 


129.5 


E 


38 - Yi (18) 


28 


103 


E 



Table S2: Metadata for populations used from the HGDP dataset. 
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Figure S2: Map of teosinte populations sampled, colored by their median estimated 
population-specific overdispersion parameter, F^. The five populations with the 
highest values are noted. 
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Figure S3: Map of human populations included in the analysis, colored by their 
median estimated population-specific overdispersion parameter, Fk. The five pop- 
ulations with the highest values are noted. The dashed line denotes the line of 
longitude used to delimit the Himalayas. 
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Histogram of partial IVIantel test p-values 

Continuous Ecological Variable Binary Ecological Variable 



a) 



I 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 



o 
c 
o 

a- 
o 



CO - 



cvi 



b) 



False positives 



I 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 



P-value 



P-value 



Figure S4: Histograms of p-values produced by the partial Mantel test (with 
1,000,000 permutations) on the 100 datasets for which the true contribution of 
ecological distance to genetic differentiation was zero. The black column indicates 
the type I error rate with a significance level of p=0.05 in: a) the datasets with 
a continuous ecological distance variable; b) the datasets with a binary ecological 
distance variable. 
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Model fit - Teosinte Data 
a) Standard Model b) Overdispersion Model 




Figure S5: Heatmapped matrices showing the performance of the model at all 
pairwise population comparisons. The posterior predictive p-value was defined as 
1 — 2 X |0.5 — ecdf{FsTa^J\, in which ecdf {Fst^^J is the empirical cumulative prob- 
ability of the observed Fst between two populations from a distribution defined 
by the posterior predictive sample for that population comparison, representing 
the p-value of a two-tailed t-test. Higher p- values indicate better model fit. Popu- 
lations are enumerated on the margins, and may be referenced in SuppMat Table 
1. a) The standard model, b) The overdispersion model. 
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Model fit -HGDP Data 




Figure S6: Heatmapped matrices indicating the performance of the model at all 
pairwise population comparisons. The posterior predictive p-value was defined as 
1 — 2 X |0.5 — ecdf{FsTa^J\, in which ecdf {Fst^^J is the empirical cumulative prob- 
ability of the observed Fst between two populations from a distribution defined 
by the posterior predictive sample for that population comparison, representing 
the p-value of a two-tailed t-test. Higher p- values indicate better model fit. Popu- 
lations are enumerated on the margins, and may be referenced in SuppMat Table 
2. a) The standard model, b) The overdispersion model. 
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MCMC Trace File - Teosinte Dataset 

standard Model 




1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 

Generations 



MCMC Trace File - Teosinte Dataset 

Overdispersion Model 




T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

1e+06 3e+06 5e+06 7e+06 9e+06 1.1e+07 1.3e+07 1.5e+07 

Generations 



Figure S7: Trace plots of the marginal posterior estimates for the aE/cxD ratio from 
MCMC analysis of the teosinte dataset. Inset figures give the marginal densities 
and 95% credible set for the samples after a burn-in of 20% a) The standard 
model, b) The overdispersion model. 
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MCMC Trace File - HGDP Dataset 

standard Model 



sample from marginal posterior 
mecjian value 
95% crerjible set 




~i — I — I — I — I — I — r 
2e+06 5e+06 



T — I — r 
1.1e+07 



1 — I — r 

1 .4e+07 



T — I — r 

1.7e+07 



"1 — I — I — I — I — I — r 
2e+07 2.3e+07 



MCMC Trace File - HGDP Dataset 

Overdispersion Model 




sample from marginal posterior 
median value 
95% credible set 



T — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — r 
2e+06 5e+06 8e+06 1.2e+07 1.6e+07 2e+07 2.4e+07 2.8e+07 3.2e+07 



Figure S8: Trace plots of the marginal posterior estimates for the aE/cxD ratio from 
MCMC analysis of the HGDP dataset. Inset figures give the marginal densities 
and 95% credible set for the samples after a burn-in of 20% a) The standard 
model, b) The overdispersion model. 
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